其他
数据集 | 30w播客(Podcast)的560w条评论数据(2005-2023)
一、数据集概况
媒体名称: Podcast
数据来源: https://podcasts.apple.com/
覆盖年度: 2005-12-10 ~ 2023-03-07
博客id数量: 303911
评论条数: 5607021
所含字段: podcast_id、title、content、rating、author_id、created_at、category等
规模庞大,字段内容丰富,适合社会学、新闻与传播学、语言学、经济学、管理学等领域学者开展研究。
二、读取数据
使用 pandas.read_json()
读取
2.1 podcasts.json
import pandas as pd
pdf = pd.read_json('podcasts.json', lines=True)
#查看podcasts.json字段
print(pdf.columns)
pdf
Run
Index(['podcast_id', 'itunes_id', 'slug', 'itunes_url', 'title', 'author',
'description', 'average_rating', 'ratings_count', 'scraped_at'],
dtype='object')
2.2 categories.json
cdf = pd.read_json('categories.json', lines=True)
#categories.json字段
print(cdf.columns)
cdf
Run
Index(['podcast_id', 'itunes_id', 'category'], dtype='object')
2.3 reviews.json
rdf = pd.read_json('reviews.json', lines=True)
#reviews.json字段
print(rdf.columns)
rdf
Run
Index(['podcast_id', 'title', 'content', 'rating', 'author_id', 'created_at'],
dtype='object')
三、实验
3.1 筛选出含某关键词的播客名
从 podcasts.json 中筛选出含 China 或 中国 的播客记录
china_podcast_df = pdf[pdf['title'].fillna('').str.contains('China')]
china_podcast_df
#查看这86个播客名
print(china_podcast_df.title.values)
Run
['China Arts Podcast'
'Made in China Podcast: International Business | Crowdfunding | Entrepreneurship'
'Chinasource Recently Added Resources' 'TIC China Network' 'UNDP China'
'Wellness in China' 'Party In China' 'Tails From China' 'Focus on China'
'CEIBS China Knowledge' 'Bottled in China' 'Environment China'
'China Money Podcast - Audio Episodes'
'China Money Podcast - Video Episodes'
'China Jedi Podcast: Expat Life | Chinese Culture | Business | Travel | China'
'China Digital Marketing Podcast' 'Goodbye China Podcast'
......
'Made In China - Noel Smith' 'Offices at China Hall Coworking Podcast'
'Carved To China' 'Made in China' 'China Innovation Decoded'
'Made in China' 'U.S.-China Dialogue Podcast' 'Falando de China'
'China Business Minute' 'Linfen China' 'Young China Watchers'
'China Business Review' 'Podbabes China' 'China Design Now (English)'
'McKinsey Greater China' 'Governing China'
'U.S./China Media Brief Program - Interviews' 'The China History Podcast'
'RailsCasts China' 'Behind the Great Firewall of China Podcast'
"China Now's Podcast" 'China: As History Is My Witness'
'Safeguarding Dunhuang for China and the World' 'Biz China'
'Chinaman Talks Sports' 'China in the World' 'The History of China'
"Forbidden City: Inside the Court of China's Emperors"
'NAFTA at Twenty: Trade, Transformation and the China Factor'
'NAFTA at Twenty: Trade, Transformation and the China Factor (Audio Only)'
'China and the Chinese by Herbert Allen Giles' 'China Doing Sweden'
'China MSG' 'Yellow Star: China News' 'Made in China']
3.2 筛选出含某关键词的内容名
筛选出含 China 的节目标题,注意podcast的title不变,但是每期的内容名(title)是变化的。
#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_title_df = rdf[rdf['title'].fillna('').str.contains('China|中国')]
china_title_df
print(china_title_df.content.values)
Run
["What's a China?" 'Thanks Justin - from China'
'American Working in China Coffee Industry' 'Babybee in China'
'Listening From China!!' 'Right on China.' 'Excellent China Series!'
'China Trade War episode was fantastic'
'Really enjoyed the China / Tariff discussion' 'China Review'
'Beautiful videos of China!' 'Learn about The Real China business'
'Doing business in China? Listen to this!' 'China'
"Insightful look into China's growing influence"
'Heavy hitters share their views on China' 'Huawei (or China)'
'Emergency China podcast was unreal' 'China Episode' 'China'
'矮大紧老师的确是现代中国文化圈里面的高山晓辉里的奇松' 'Love the China rant' '中国好'
'Band in China' '关于中国生活有趣的观点' 'Deep and personal angle to look at China'
'Saying hi from China' '终于有一档中国记者做的播客' 'China’s’ Detention Camps'
'China Tech Insider' 'She-G-string. king of China'
......
......
'Required listening to keep up with contemporary China'
'Most antiChina guests and content' 'Fantastic China-centric podcast'
'The best Podcast on China-related topics' 'Big trouble in little China'
'中国最好的游戏广播。' '中国第一家做游戏广播的!!' 'The best game radio in China!'
'Best Podcast on China’s History'
'Great China Insights and interview topics'
'Great new Content on China and Sede Vacante' '没有中国特色'
'Into China Marketing? This is the Podcast!'
'“You can’t be angry at China for the lab leak...”'
'China and the risk of nuclear conflict' 'The China threat'
'The second-best China-Africa podcast there is!'
'Review of Mosaic of China'
'Really interesting look at the people who live in China'
'A good way to peek inside of China!' 'How can we be more like China?'
'A must listen for China policitics nerds!'
'Great way to keep up on the EU-China relationship'
'Interesting and informative podcast on China'
'SCTV from the South China Sea' 'China and Omicron' 'Strangers in China'
'China seems very scary' 'China Lockdown'
'I travel to China regularly just to listen'
'Best American News I Can Find in China!!!!']
Selection deleted
3.3 筛选出含某关键词的评论
#从 reviews.json 中筛选出含 China 或 中国 的评论记录
china_reviews_df = rdf[rdf['content'].fillna('').str.contains('China|中国')]
china_reviews_df
四、获取方式
200元,加微信 372335839, 备注【姓名-学校-专业-博客】。